R for Chemistry Data Analysis and Chemometrics
Chemometrics with R
Jordi Cuadros
January 2022

Introduction and References

Chemometrics

“Chemometrics is the chemical discipline that uses mathematical, statistical, and other methods employing formal logic to design or select optimal measurement procedures and experiments, and to provide maximum relevant chemical information by analyzing chemical data.”

Vékey, K., Telekes, A., & Vertes, A. (Eds.). (2011). Medical applications of mass spectrometry. Elsevier.

Main Chemometrics Applications

  • Design of experiments (before data acquisition, mind works better than software here)
  • Chemical data preprocessing
  • Multivariate exploratory analysis
  • Classification
  • Modelling
  • Multiple Curve Resolution

R packages for Chemistry and Chemometrics

Updated information can be found following the CRAN Task View: Chemometrics and Computational Physics, (https://cran.r-project.org/web/views/ChemPhys.html).

There are many other packages that will useful for specific tasks. These will appear along the session.

Chemometrics with R – References

Chemical Data Preprocessing

Before Starting, some Package Management

pckgs<-c("tidyverse","ggthemes","GGally",
         "ggalt","gtools","gridExtra")
pckgs2Install<-pckgs[!(pckgs %in% library()$results[,1])]
pckgs2Load<-pckgs[!(pckgs %in% (.packages()))]
for(pckg in pckgs2Install) {
  install.packages(pckg,
    repos="https://cloud.r-project.org/",
    quiet=TRUE, type="binary")
}
for(pckg in pckgs2Load) {
  library(pckg, character.only = TRUE)
}

From Raw Data to Processed Data

Many times raw data needs to be processed to be analyzed and compared to preexisting data.

Common processing includes

  • Noise removal
  • Baseline adjustment
  • Peak alignment
  • Binning
  • Peak picking
  • Scaling
  • Managing missing data
  • Removing outliers
  • Comparing complex data

In the case of chemical data, this is specially true is for spectral data.

Spectral Data

There are different types of spectral data

  • Raw/Profile Data
  • Clean Data (with some adjustments)
  • Binned Data
  • Centroided Data
  • Peak Lists
    • With Intensity
    • Without Intensity

Data Sets

Let’s explore this process with several data sets:

  • a gas chromatogram from ptw::gaschrom,
  • a NIR spectra from pls::gasoline,
  • a set of NIR of wood samples from the RNIR web book, https://guifh.github.io/RNIR/index.html,
  • an extract from a LC-MS from the msdata package, and
  • a MALDI-TOF spectra from a milk mixture available in the baseline package.
if(!require("ptw")) {
  install.packages("ptw", repos="https://cloud.r-project.org/")
  library("ptw")
}
data("gaschrom")
help("gaschrom", package="ptw")
gchrom <- data.frame(rt=1:5000,
                     I=gaschrom[1,])
plot(gchrom, type="l")

if(!require("pls")) {
  install.packages("pls", repos="https://cloud.r-project.org/")
  library("pls")
}
data(gasoline)
help("gasoline", package="pls")
gasolineNIR <- data.frame(wl=seq(900,1700,by=2),
                          logRef=gasoline$NIR[4,])
plot(gasolineNIR, type="l")

woodNIR <- get(load("../data/mydataPHAZIR.Rdata"))

spectra <- as.matrix(woodNIR$NIR)
dfSpectrum <- data.frame(wl = colnames(spectra),
                         Abs = as.numeric(spectra[1,])) %>% 
  mutate(wl=as.numeric(substr(wl,2,nchar(wl))))

plot(dfSpectrum, type="l")

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager",
                   repos="https://cloud.r-project.org/")
if (!require("msdata", quietly = TRUE)) {
  BiocManager::install("msdata", update=FALSE)
  library("msdata")
}

dir(system.file("sciex", package = "msdata"))
if (!require("mzR", quietly = TRUE)) {
  BiocManager::install("mzR", update=FALSE)
  library("mzR")
}

sciexMS <- mzR::openMSfile(
  system.file("sciex/20171016_POOL_POS_1_105-134.mzML",
              package = "msdata"))
sciexSpec <- mzR::spectra(sciexMS)
length(sciexSpec)
## [1] 931
dim(sciexSpec[[1]])
## [1] 578   2
plot(sciexSpec[[1]], type="l")

plot(sciexSpec[[931]], type="l")

if(!require("baseline")) {
  install.packages("baseline",
                   repos="https://cloud.r-project.org/")
  library("baseline")
}
data("milk")
str(milk)
## 'data.frame':    45 obs. of  2 variables:
##  $ cow    : num  0 0.25 0.375 0.875 0.5 0.75 0.5 0.125 0 0.125 ...
##  $ spectra: num [1:45, 1:21451] 1029 371 606 368 554 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:21451] "4999.94078628963" "5001.55954267662" "5003.17856106153" "5004.79784144435" ...
##  - attr(*, "terms")=Classes 'terms', 'formula'  language cow ~ spectra
##   .. ..- attr(*, "variables")= language list(cow, spectra)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "cow" "spectra"
##   .. .. .. ..$ : chr "spectra"
##   .. ..- attr(*, "term.labels")= chr "spectra"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(cow, spectra)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "nmatrix.21451"
##   .. .. ..- attr(*, "names")= chr [1:2] "cow" "spectra"
help("milk")

milkMS <- data.frame(mz = as.numeric(colnames(milk$spectra)),
                     I = milk$spectra[20,])
plot(milkMS, type="l")

Noise Removal

Main methods for noise removal are:

Running medians, threshold, aggregating repetitions and consensus approaches are also used.

Noise Removal – Moving Average

milkMS_mave <- milkMS
milkMS_mave$I <- as.numeric(stats::filter(milkMS_mave$I, 
              rep(1,13)/13))

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
g2 <- ggplot(milkMS_mave, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
grid.arrange(g1,g2)

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
  theme_classic()
g2 <- ggplot(milkMS_mave, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
  theme_classic()
grid.arrange(g1,g2)

Noise Removal – Savitzky-Golay

milkMS_sg7 <- milkMS
milkMS_sg7$I <- as.numeric(stats::filter(milkMS_mave$I, 
              c(-2,3,6,7,6,3,-2)/21))

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
  theme_classic()
g2 <- ggplot(milkMS_sg7, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,11000),ylim=c(0,500)) +
  theme_classic()
grid.arrange(g1,g2)

Noise Removal – FFT

I_fft <- fft(milkMS$I)
I_fft[513:(length(I_fft)-512)] <- 0
milkMS_fft <- data.frame(mz = milkMS$mz,
        I=Re(fft(I_fft, inverse=TRUE)/length(I_fft)))

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
g2 <- ggplot(milkMS_fft, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
grid.arrange(g1,g2)

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,15000),ylim=c(0,4000)) +
  theme_classic()
g2 <- ggplot(milkMS_fft, aes(x=mz, y=I)) + 
  geom_line() +
  coord_cartesian(xlim=c(10000,15000),ylim=c(0,4000)) +
  theme_classic()
grid.arrange(g1,g2)

Baseline Adjustment

Baseline adjustment is commonly done by finding local minima and fitting a curve to them. This curve is then substracted to the spectrum.

Common methods for baseline adjustment are:

  • Multiplicative Scatter Correction (MSC), for NIR, and
  • Asymmetric Least Squares (ALS).

MSC is available in the pls package. ALS and many other methods, such as LOWESS or a polynomial fit, are implemented in the baseline package.

Baseline Adjustment - MSC

originalspectra <- as.matrix(woodNIR$NIR)
newspectra<-msc(originalspectra)            
mydataOrig <- data.frame(id=woodNIR[,5], NIR = originalspectra) %>% 
  pivot_longer(cols=-1, names_to = "wl", values_to = "Abs") %>% 
  mutate(wl=as.numeric(substr(wl,6,nchar(wl))))
mydataMSC <- data.frame(id=woodNIR[,5], NIR = newspectra) %>% 
  pivot_longer(cols=-1, names_to = "wl", values_to = "Abs") %>% 
  mutate(wl=as.numeric(substr(wl,6,nchar(wl))))             

g1 <- ggplot(mydataOrig, aes(x=wl, y=Abs, color=as.factor(id))) + 
  geom_line() +
  theme_classic() + theme(legend.position = "none") 
g2 <- ggplot(mydataMSC, aes(x=wl, y=Abs, color=as.factor(id))) + 
  geom_line() +
  theme_classic() + theme(legend.position = "none") 
grid.arrange(g1,g2)

Baseline Adjustment - ALS

milkMS_als <- milkMS
sp <- baseline.als(t(milkMS_als$I), lambda=10)
milkMS_als$I <- sp$corrected[1,]

g1 <- ggplot(milkMS, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
g2 <- ggplot(milkMS_als, aes(x=mz, y=I)) + 
  geom_line() +
  theme_classic()
grid.arrange(g1,g2)

Peak Alignment

Different phenomena can provoke misalignments among spectra. In order to compare them peaks must be aligned. In many cases, this is experimentally done with standards or solved through binning; but in some cases, data processing is required, for example in LC chromatograms.

Shifts can be corrected using time warping functions, for example the ptw or the dtw functions in the package with the same names.

gchrom1 <- data.frame(rt=1:5000,
                     I=gaschrom[1,])
gchrom2 <- data.frame(rt=1:5000,
                     I=gaschrom[16,])

ggplot(gchrom1,aes(x=rt, y=I)) +
  geom_line(aes(y=I+50), data=gchrom2, color="red") +
  geom_line() +
  theme_classic()

wcc(gchrom1[,2], gchrom2[,2], length(gchrom1[,2]))
## [1] 0.9980289
res <- ptw(gchrom1[,2], gchrom2[,2], init.coef=c(0,1,0))
summary(res)    # WCC is a distance
## PTW object: individual alignment of 1 sample on 1 reference.
## 
## Warping coefficients:
##           [,1]     [,2]          [,3]
## [1,] -29.22414 1.029991 -1.041421e-05
## 
## Warping criterion: WCC
## Warping mode: forward
## Value: 0.01627874
gchrom2c <- data.frame(rt=gchrom1$rt,
                       I=res$warped.sample[1,])
ggplot(gchrom1,aes(x=rt, y=I)) +
  geom_line(aes(y=I+50), data=gchrom2c, color="green") +
  geom_line() +
  theme_classic()

Binning

Especially in MS, binning is a required process to allow for spectra comparison.

dfMS1 <- data.frame(mz = sciexSpec[[1]][,1],
                    I = sciexSpec[[1]][,2])
dfMS2 <- dfMS1 %>% mutate(mz=round(2*mz)/2) %>% 
  group_by(mz) %>% summarize(I=sum(I),.groups="drop")
dfMS3 <- dfMS1 %>% mutate(mz=round(mz/2)*2) %>% 
  group_by(mz) %>% summarize(I=sum(I),.groups="drop")

g1 <- ggplot(dfMS1, aes(x=mz, xend=mz,y=0, yend=I))+
  geom_segment() + theme_classic()
g2 <- ggplot(dfMS2, aes(x=mz, xend=mz,y=0, yend=I))+
  geom_segment() + theme_classic()
g3 <- ggplot(dfMS3, aes(x=mz, xend=mz,y=0, yend=I))+
  geom_segment() + theme_classic()

grid.arrange(g1,g2,g3)

Peak Picking

It is also common in spectra processing to convert the profile data to a peak list.

This is usually done looking for local maxima and/or applying some threshold consideration.

str(dfMS2)
## tibble [41 x 2] (S3: tbl_df/tbl/data.frame)
##  $ mz: num [1:41] 105 107 108 108 109 ...
##  $ I : num [1:41] 412 1236 4121 412 2884 ...
dfMS2[dfMS2$I/max(dfMS2$I)>.05,]
## # A tibble: 3 x 2
##      mz      I
##   <dbl>  <dbl>
## 1   123 100588
## 2   124 646800
## 3   125  62351
span <- 100
dfMS1_g <- embed(dfMS1$I,span)
dfMS1_gmax <- span + 1 - apply(dfMS1_g,1,which.max)
dfMS1_gmax[dfMS1_gmax==1 | dfMS1_gmax==span] <- NA
dfMS1_peaksPos <- dfMS1_gmax + 0:(length(dfMS1_gmax)-1)
dfMS1_peaksPos <- unique(na.omit(dfMS1_peaksPos))
dfMS1$mz[dfMS1_peaksPos]
##  [1] 114.0914 122.0956 123.0919 124.0860 125.0903 125.0966 126.4697 126.5886
##  [9] 126.6474 126.7394 126.7760 126.9332 126.9571 126.9666 128.1406 129.9880

Scaling

In many cases, scaling can change the outcomes of an analysis, so it is important to decide what scaling makes sense. Some examples can be

  • Scaling towards an internal reference,
  • Scaling towards the maximum signal,
  • Scaling towards the total area, or
  • Not scaling at all.

Outliers and Missing Data Imputation

Although analysis of leverage points, dtection of outliers and imputation of missing data are always relevant, we will not discuss it further here. Just keep in mind there is always a necessary step in any data analysis.

Data Matching

All this process is many times done to compare chemical entities: spectra, molecules, mixtures… In any of this cases, decisions will need to be made in order to decide when to data match and/or how distance must be calculated between two data instances.

Some examples of strategies for matching data are

  • Fingerprints
  • Layered notations, like InChI,
  • Hashed codes, like InChIKey or SPLASH
  • Correlation coefficients,
  • Distances on the principal components space.

Multivariant Exploratory Analysis

Introduction and Data

Whether it comes from spectral data or not, it is common to have multivariant data in chemical analysis. In fact, chemometrics is commonly restricted to the studya and analysis of multivariant data.

Common multivariant techniques in chemometrics are

  • Exploratory anlysis
  • Classification
  • Modelling
  • Multiple Curve Resolution

We will discuss here some aspects of multivariant exploratory analysis.

We’ll use the classical Forina’s wines dataset, which is included in the kohonen package.

if(!require("kohonen")) {
  install.packages("kohonen",
                   repos="https://cloud.r-project.org/")
  library("kohonen")
}

data("wines")
dfWines <- data.frame(wines)
str(dfWines)
## 'data.frame':    177 obs. of  13 variables:
##  $ alcohol          : num  13.2 13.2 14.4 13.2 14.2 ...
##  $ malic.acid       : num  1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 2.16 ...
##  $ ash              : num  2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 2.3 ...
##  $ ash.alkalinity   : num  11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 18 ...
##  $ magnesium        : num  100 101 113 118 112 96 121 97 98 105 ...
##  $ tot..phenols     : num  2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 2.95 ...
##  $ flavonoids       : num  2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 3.32 ...
##  $ non.flav..phenols: num  0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 0.22 ...
##  $ proanth          : num  1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 2.38 ...
##  $ col..int.        : num  4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 5.75 ...
##  $ col..hue         : num  1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 1.25 ...
##  $ OD.ratio         : num  3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 3.17 ...
##  $ proline          : num  1050 1185 1480 735 1450 ...
summary(dfWines[,1:5])
##     alcohol        malic.acid        ash        ash.alkalinity 
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50  
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52  
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00  
##    magnesium     
##  Min.   : 70.00  
##  1st Qu.: 88.00  
##  Median : 98.00  
##  Mean   : 99.59  
##  3rd Qu.:107.00  
##  Max.   :162.00
summary(dfWines[,6:10])
##   tot..phenols     flavonoids    non.flav..phenols    proanth     
##  Min.   :0.980   Min.   :0.340   Min.   :0.1300    Min.   :0.410  
##  1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700    1st Qu.:1.250  
##  Median :2.350   Median :2.130   Median :0.3400    Median :1.550  
##  Mean   :2.292   Mean   :2.023   Mean   :0.3623    Mean   :1.587  
##  3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400    3rd Qu.:1.950  
##  Max.   :3.880   Max.   :5.080   Max.   :0.6600    Max.   :3.580  
##    col..int.     
##  Min.   : 1.280  
##  1st Qu.: 3.210  
##  Median : 4.680  
##  Mean   : 5.055  
##  3rd Qu.: 6.200  
##  Max.   :13.000
summary(dfWines[,11:13])
##     col..hue        OD.ratio        proline      
##  Min.   :0.480   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:0.780   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :0.960   Median :2.780   Median : 672.0  
##  Mean   :0.957   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.120   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :1.710   Max.   :4.000   Max.   :1680.0

Correlations

corWines <- data.frame(cor(dfWines)) %>% 
  rownames_to_column("var1") %>% 
  pivot_longer(2:14, names_to="var2", values_to = "cor") %>% 
  filter(var1<var2) %>% 
  arrange(desc(abs(cor)))
str(corWines)
## tibble [78 x 3] (S3: tbl_df/tbl/data.frame)
##  $ var1: chr [1:78] "flavonoids" "flavonoids" "OD.ratio" "flavonoids" ...
##  $ var2: chr [1:78] "tot..phenols" "OD.ratio" "tot..phenols" "proanth" ...
##  $ cor : num [1:78] 0.864 0.786 0.7 0.65 0.641 ...
head(corWines)
## # A tibble: 6 x 3
##   var1       var2           cor
##   <chr>      <chr>        <dbl>
## 1 flavonoids tot..phenols 0.864
## 2 flavonoids OD.ratio     0.786
## 3 OD.ratio   tot..phenols 0.700
## 4 flavonoids proanth      0.650
## 5 alcohol    proline      0.641
## 6 proanth    tot..phenols 0.611
tail(corWines)
## # A tibble: 6 x 3
##   var1           var2            cor
##   <chr>          <chr>         <dbl>
## 1 magnesium      malic.acid -0.0490 
## 2 magnesium      OD.ratio    0.0470 
## 3 col..int.      proanth    -0.0271 
## 4 ash.alkalinity col..int.   0.0205 
## 5 ash            proanth     0.00808
## 6 ash            OD.ratio    0.00150
if(!require("corrplot")) {
  install.packages("corrplot",
                   repos="https://cloud.r-project.org/")
  library("corrplot")
}

corrplot(cor(wines))

corrplot(cor(wines), order="hclust")

ggpairs(dfWines, axisLabels = "none",
        lower = list(continuous = wrap("points",
              alpha = 0.3, shape = 21, size = 1))) + 
  theme_classic()

Factor Analysis

It consists in searching for a limited number of factor that can conserve most of the data variability.

The most common way to perform this analysis is

  • Principal Components Analysis (PCA), followed by
  • Rotation

Factor Analysis – PCA Assumptions

if(!require("REdaS")) {
  install.packages("REdaS",
                   repos="https://cloud.r-project.org/")
  library("REdaS")
}
REdaS::KMOS(dfWines)  # Kaiser-Meyer-Olkin Sample Adequacy
## 
## Kaiser-Meyer-Olkin Statistics
## 
## Call: REdaS::KMOS(x = dfWines)
## 
## Measures of Sampling Adequacy (MSA):
##           alcohol        malic.acid               ash    ash.alkalinity 
##         0.7251534         0.7965202         0.4383797         0.6790194 
##         magnesium      tot..phenols        flavonoids non.flav..phenols 
##         0.6667702         0.8722492         0.8138751         0.8208894 
##           proanth         col..int.          col..hue          OD.ratio 
##         0.8535643         0.6213451         0.7878646         0.8644313 
##           proline 
##         0.8138733 
## 
## KMO-Criterion: 0.7767008
REdaS::bart_spher(cor(dfWines))  # Bartlett's Sphericty Test
## 
##  Bartlett's Test of Sphericity
## 
## Call: REdaS::bart_spher(x = cor(dfWines))
## 
##      X2 = 396.936
##      df = 78
## p-value < 2.22e-16

Factor Analysis – PCA

pcaWines <- princomp(dfWines, cor=TRUE)
summary(pcaWines)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.1628220 1.5815708 1.2055413 0.96148018 0.92829777
## Proportion of Variance 0.3598307 0.1924128 0.1117946 0.07111109 0.06628744
## Cumulative Proportion  0.3598307 0.5522435 0.6640381 0.73514919 0.80143663
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.80302411 0.74295478 0.59223207 0.53775461 0.49679842
## Proportion of Variance 0.04960367 0.04246014 0.02697991 0.02224462 0.01898528
## Cumulative Proportion  0.85104030 0.89350044 0.92048035 0.94272497 0.96171025
##                           Comp.11    Comp.12     Comp.13
## Standard deviation     0.47480542 0.41033745 0.322412350
## Proportion of Variance 0.01734155 0.01295206 0.007996133
## Cumulative Proportion  0.97905180 0.99200387 1.000000000
str(pcaWines)
## List of 7
##  $ sdev    : Named num [1:13] 2.163 1.582 1.206 0.961 0.928 ...
##   ..- attr(*, "names")= chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ loadings: 'loadings' num [1:13, 1:13] 0.13789 -0.24638 -0.00432 -0.23738 0.135 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
##   .. ..$ : chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ center  : Named num [1:13] 12.99 2.34 2.37 19.52 99.59 ...
##   ..- attr(*, "names")= chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
##  $ scale   : Named num [1:13] 0.807 1.116 0.274 3.327 14.134 ...
##   ..- attr(*, "names")= chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
##  $ n.obs   : int 177
##  $ scores  : num [1:177, 1:13] 2.23 2.53 3.75 1.02 3.05 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:13] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ call    : language princomp(x = dfWines, cor = TRUE)
##  - attr(*, "class")= chr "princomp"
biplot(pcaWines)

biplot(pcaWines, choices=c(1,3))

screeplot(pcaWines)

Factor Analysis – Rotation

rfa <- varimax(pcaWines$loadings[,1:3], eps=1e-14)
rfa
## $loadings
## 
## Loadings:
##                   Comp.1 Comp.2 Comp.3
## alcohol                   0.538       
## malic.acid        -0.224  0.137 -0.221
## ash                0.179  0.162 -0.656
## ash.alkalinity           -0.198 -0.628
## magnesium          0.135  0.285 -0.165
## tot..phenols       0.409  0.122       
## flavonoids         0.446              
## non.flav..phenols -0.219        -0.254
## proanth            0.338              
## col..int.         -0.203  0.512       
## col..hue           0.342 -0.213       
## OD.ratio           0.430 -0.112       
## proline            0.169  0.442  0.101
## 
##                Comp.1 Comp.2 Comp.3
## SS loadings     1.000  1.000  1.000
## Proportion Var  0.077  0.077  0.077
## Cumulative Var  0.077  0.154  0.231
## 
## $rotmat
##            [,1]      [,2]       [,3]
## [1,]  0.9244318 0.2184311  0.3125918
## [2,] -0.1295040 0.9508070 -0.2814155
## [3,] -0.3586844 0.2196676  0.9072440
tfa <- promax(pcaWines$loadings[,1:3], m=4)
tfa
## $loadings
## 
## Loadings:
##                   Comp.1 Comp.2 Comp.3
## alcohol                   0.546       
## malic.acid        -0.231  0.139 -0.188
## ash                0.170        -0.682
## ash.alkalinity           -0.244 -0.629
## magnesium          0.132  0.262 -0.184
## tot..phenols       0.412              
## flavonoids         0.450              
## non.flav..phenols -0.225        -0.222
## proanth            0.340              
## col..int.         -0.210  0.525       
## col..hue           0.349 -0.234       
## OD.ratio           0.436 -0.146       
## proline            0.170  0.436       
## 
##                Comp.1 Comp.2 Comp.3
## SS loadings     1.023  1.013  1.023
## Proportion Var  0.079  0.078  0.079
## Cumulative Var  0.079  0.157  0.235
## 
## $rotmat
##            [,1]      [,2]       [,3]
## [1,]  0.9380762 0.1661701  0.1789950
## [2,] -0.1411449 0.9414354 -0.2598980
## [3,] -0.3507245 0.3153516  0.9609461
if(!require("GPArotation")) {
  install.packages("GPArotation",
                   repos="https://cloud.r-project.org/")
  library("GPArotation")
}
obfa <- GPArotation::oblimin(pcaWines$loadings[,1:3])
obfa
## Oblique rotation method Oblimin Quartimin converged.
## Loadings:
##                    Comp.1  Comp.2  Comp.3
## alcohol            0.0231  0.5398  0.0666
## malic.acid        -0.2459  0.1492 -0.1997
## ash                0.0975  0.1298 -0.6809
## ash.alkalinity    -0.0878 -0.2141 -0.6116
## magnesium          0.1227  0.2692 -0.1960
## tot..phenols       0.4063  0.0874 -0.0833
## flavonoids         0.4438  0.0249 -0.0607
## non.flav..phenols -0.2536 -0.0675 -0.2208
## proanth            0.3312  0.0456 -0.0951
## col..int.         -0.1890  0.5254 -0.0574
## col..hue           0.3434 -0.2381  0.0653
## OD.ratio           0.4237 -0.1460 -0.0329
## proline            0.1979  0.4292  0.0553
## 
## Rotating matrix:
##        [,1]  [,2]   [,3]
## [1,]  0.966 0.151  0.186
## [2,] -0.129 0.952 -0.316
## [3,] -0.228 0.272  0.931
## 
## Phi:
##          [,1]   [,2]     [,3]
## [1,]  1.00000 0.0387 -0.00713
## [2,]  0.03868 1.0000  0.01911
## [3,] -0.00713 0.0191  1.00000

Clustering

The most common clustering strategies are

  • Hierarchical clustering, especially for small data sets, and
  • K-means, for large data sets but consider selecting the features before clustering.

Clustering - Hierarchical

dfWinesScaled <- data.frame(scale(dfWines))
clWines_hca <- hclust(stats::dist(dfWinesScaled,
                                  method="euclidean"),
                      method="ward.D2")
str(clWines_hca)
## List of 7
##  $ merge      : int [1:176, 1:2] -9 -131 -11 -15 -92 -34 -16 -164 -20 -22 ...
##  $ height     : num [1:176] 1.16 1.19 1.21 1.22 1.24 ...
##  $ order      : int [1:177] 158 159 153 175 176 167 171 164 172 156 ...
##  $ labels     : NULL
##  $ method     : chr "ward.D2"
##  $ call       : language hclust(d = stats::dist(dfWinesScaled, method = "euclidean"), method = "ward.D2")
##  $ dist.method: chr "euclidean"
##  - attr(*, "class")= chr "hclust"
winesGroups_hca <- cutree(clWines_hca, k=3)
table(winesGroups_hca)
## winesGroups_hca
##  1  2  3 
## 63 58 56
if(!require("ggdendro")) {
  install.packages("ggdendro",
                   repos="https://cloud.r-project.org/")
  library("ggdendro")
}
ggdendrogram(clWines_hca, labels=FALSE) +
  geom_hline(yintercept=9.2, color="red") +
  theme(axis.text.x = element_blank())

Clustering - K-means

dfWinesScaled <- data.frame(scale(dfWines))
clWines_km <- kmeans(dfWinesScaled, centers=3)
str(clWines_km)
## List of 9
##  $ cluster     : int [1:177] 3 3 3 3 3 3 3 3 3 3 ...
##  $ centers     : num [1:3, 1:13] 0.174 -0.918 0.833 0.864 -0.395 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:13] "alcohol" "malic.acid" "ash" "ash.alkalinity" ...
##  $ totss       : num 2288
##  $ withinss    : num [1:3] 326 559 382
##  $ tot.withinss: num 1268
##  $ betweenss   : num 1020
##  $ size        : int [1:3] 51 65 61
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
winesGroups_km <- clWines_km$cluster
table(winesGroups_km)
## winesGroups_km
##  1  2  3 
## 51 65 61
if(!require("mclust")) {
  install.packages("mclust",
                   repos="https://cloud.r-project.org/")
  library("mclust")
}
adjustedRandIndex(winesGroups_hca,winesGroups_km)
## [1] 0.8518456

dfWinesClust <- cbind(dfWines,
                      hca=factor(winesGroups_hca),
                      km=factor(winesGroups_km))
g1 <- ggplot(dfWinesClust,
    aes(x=flavonoids, y=alcohol, color=hca)) + 
  geom_point(shape=3) + geom_encircle() + theme_classic() +
  theme(legend.position = "top")
g2 <- ggplot(dfWinesClust,
    aes(x=flavonoids, y=alcohol, color=km)) + 
  geom_point(shape=3) + geom_encircle() + theme_classic() +
  theme(legend.position = "top")

grid.arrange(g1,g2,ncol=2)

Self-Organizing Maps

YOUR TURN

  1. Explore the data from mango aroma components available at https://figshare.com/articles/dataset/Mango_Mangifera_indica_Aroma_Discriminate_Cultivars_and_Ripeness_Stages/14303913. Warning: it needs some cleaning!

Classification

MORE TO COME

Discriminant Analysis

kNN

Classification – Other ML Approaches

Classification Trees & Forests

SVN

Neural Networks

Modelling

Calibration Models

Multiple Regression

Step-wise Regression

Regression Trees

PCR/PLS

Ridge-Regression

Non-linear Regression

Structural Equation Models

Multiple Curve Resolution